Intro
This report summarises the R-code used to derive the socio-demographic index (SDI) covariate for the GPW13 baseline and projection stream of work. It also provides the input data and generated estimates of SDI together with plots showing the variation of the estimates by country, region and over time. The SDI composite variable generated here is a version of the Socio-demographic Index (SDI) that was originally developed for the 2015 Global Burden of Disease study. It is a development status indicator comprising of components that are strongly correlated with health outcomes and is calculated as the geometric mean of normalised 0 to 1 indices of total fertility rate, mean years of schooling for those aged 15 and older, and lagged income per capita. Although there are SDI versions generated for the GBD2019 study, the input data used there are not publically available e.g. fertility. Thus, we construct the metric using publicly available fertility data from UNDESA World Population Prospects 2019 (WPP2019), as well as GDP per capita and education data from the UNDP Human Development Reports
Input data and SDI calculation
The code below reads in the income, education and fertility data mentioned above.
#################################################################################################################
# load packages
pacman::p_load(dplyr, tidyr, data.table, pspline, INLA, countrycode)
# Not in
'%!in%' <- function(x,y)!('%in%'(x,y))
# Location data by location name, iso3 and UN country code
oceania <- c("Australia and New Zealand","Melanesia", "Micronesia", "Polynesia")
loc_label<- fread("data/name_code_iso3.csv") %>%
mutate(region = countrycode(iso3, "iso3c", "region"),
region = ifelse(region %in% oceania, "Oceania", region))
isos <- sort(unique(loc_label$iso3))
#################################################################################################################
# Read in the data to be used (education, TFR and GDP)
# Raw data
educ.r <- fread("data/educ.csv", header=T)
tfr.r <- fread("data/fert.csv", header=T)
gdp.r <- fread("data/gdppc.csv", header=T)Next, the raw data are smoothed to reduce random errors and the results are used to generate the SDI variable:
#################################################################################################################
# Function to smooth the historical data for random fluctuations
smoother <- function(df, var, lagy){
setnames(df, var, "y")
dfx <- df %>% mutate(year = year + lagy) %>%
select(iso3,year,y) %>% filter(iso3 %in% isos & year>1989)
dfy <- data.table(iso3=character(), year=integer(), y=integer())
for (is in unique(dfx$iso3)){
dfz <- dfx %>% filter(iso3 == is) %>% arrange(year)
if (nrow(dfz) < 2){
dfy <- rbind(dfy, dfz)
} else {
xi = dfz$year; yi = dfz$y; xp = min(xi):max(xi)
yp <- predict(smooth.Pspline(xi, yi, spar = 1), xp)[,1]
dfy <- rbind(dfy, data.table(iso3 = is, year = xp, y = yp))
}
}
dfy %>% setnames(., "y", var) %>% arrange(iso3, year)
}
#################################################################################################################
#Smoothed data
educ <- educ.r %>% smoother(.,"educ", 0)
tfr <- tfr.r %>% smoother(.,"tfr", 0)
gdp <- gdp.r %>% smoother(.,"gdp", 5) %>% mutate(ldi=log(gdp), gdp=NULL)
# Issue with Liberia, smooth with lowess
lbr.ldi <- (gdp %>% filter(iso3 == "LBR") %>% pull(ldi) %>% lowess())$y
gdp <- gdp %>% mutate(ldi = ifelse(iso3 == "LBR", lbr.ldi, ldi))
#################################################################################################################
# Bring it all together i.e. transform variables to 0-1 scale, calculate SDI
sdi.vars <- educ %>%
left_join(tfr, by = c("iso3", "year")) %>%
left_join(gdp, by = c("iso3", "year")) %>%
left_join(loc_label, by = "iso3") %>%
select(iso3, year, ldi, tfr, educ, region) %>%
filter(!(is.na(ldi)|is.na(tfr)|is.na(educ))) %>%
mutate(educ.t = (educ - min(educ))/(max(educ) - min(educ)),
ldi.t = (ldi - min(ldi))/(max(ldi) - min(ldi)),
tfr.t = 1 - (tfr - min(tfr))/(max(tfr) - min(tfr))) %>%
rowwise() %>%
mutate(sdi.pub = round(prod(educ.t, ldi.t, tfr.t)^(1/3),3)) %>% ungroup() %>%
filter(year > 1999)
sdi.data <- list(loc_label=loc_label, educ.r=educ.r, tfr.r=tfr.r, gdp.r=gdp.r,
educ=educ, tfr=tfr, gdp=gdp, sdi.vars=sdi.vars)The raw SDI input data, smoothed intermediate files and calculated SDI can be downloaded as an R-database using the sdi.data.RDa link below:
A number of countries are missing at least one input and thus an SDI cannot be estimated for those countries directly. These countries are listed below:
## [1] "AND" "ATG" "BHS" "BLZ" "BRN" "BTN" "COK" "DMA" "ERI" "FJI" "FSM" "GRD"
## [13] "GUY" "KIR" "KNA" "MCO" "MDV" "MHL" "NIU" "NRU" "PLW" "PNG" "PRK" "SLB"
## [25] "SMR" "SOM" "SSD" "SUR" "SVK" "TLS" "TON" "TUV" "VCT" "VUT" "WSM"
For these missing countries we can impute the SDI according to a regression model. As mentioned previously, a version of the SDI has been generated for the GBD2019 using other data sources that are not publically available. The GBD2019 SDI (sdi.gbd) and the SDI from the public data (sdi.pub) can be compared:
#################################################################################################################
# sdi estimate according to GBD2019
sdi.gbd.r <- fread("data/sdigbd2019.csv", header = T)
sdi.gbd <- sdi.gbd.r %>%
gather(year, sdi.gbd, -location_name) %>%
mutate(year = as.numeric(year)) %>% filter(year > 1999) %>%
right_join(loc_label, by = "location_name")
#################################################################################################################
# Data frame with both GBD and public SDI and where possible, demographic ind
sdi.mod.df <- sdi.vars %>%
select(iso3, year, sdi.pub) %>%
right_join(sdi.gbd, by = c("iso3","year")) %>%
filter(!(iso3 %!in% missing.sdi & is.na(sdi.pub))) %>%
select(location_name, iso3, region, year, sdi.pub, sdi.gbd) %>%
arrange(iso3, year) To impute the SDI in the 35 locations listed above, a simple linear regression model is fit to sdi.pub using the sdi.gbd as well as region.
#################################################################################################################
# Regression model to predict public SDI from GBD SDI
sdi.mod <- lm(sdi.pub ~ sdi.gbd + region - 1, data = sdi.mod.df)The model estimates are shown in the Table below:
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| sdi.gbd | 1.1 | 1.0, 1.1 | <0.001 |
| region | |||
| Caribbean | — | — | |
| Caribbean | -0.02 | -0.02, -0.01 | <0.001 |
| Central America | -0.01 | -0.02, 0.00 | 0.102 |
| Central Asia | 0.00 | -0.01, 0.01 | 0.464 |
| Eastern Africa | -0.05 | -0.06, -0.05 | <0.001 |
| Eastern Asia | -0.04 | -0.05, -0.03 | <0.001 |
| Eastern Europe | -0.02 | -0.03, -0.01 | <0.001 |
| Middle Africa | -0.06 | -0.07, -0.06 | <0.001 |
| Northern Africa | -0.06 | -0.07, -0.05 | <0.001 |
| Northern America | -0.02 | -0.03, -0.01 | 0.004 |
| Northern Europe | -0.05 | -0.06, -0.04 | <0.001 |
| Oceania | -0.03 | -0.04, -0.01 | <0.001 |
| South-Eastern Asia | -0.04 | -0.05, -0.03 | <0.001 |
| South America | -0.02 | -0.02, -0.01 | <0.001 |
| Southern Africa | -0.03 | -0.04, -0.02 | <0.001 |
| Southern Asia | -0.01 | -0.02, 0.00 | 0.017 |
| Southern Europe | -0.04 | -0.05, -0.03 | <0.001 |
| Western Africa | -0.06 | -0.06, -0.05 | <0.001 |
| Western Asia | -0.06 | -0.07, -0.05 | <0.001 |
| Western Europe | -0.06 | -0.07, -0.04 | <0.001 |
| R2 = 0.9974; R2 adjusted = 0.9974 | |||
|
1
CI = Confidence Interval
|
|||
Based on the model diagnostics and assessing the model fit by comparing the fitted to the observed values, we see that in general, the model fits the data well.
The model is used to impute SDI where missing:
#################################################################################################################
# Imputing SDI using the linear regression model on sdi.gbd
sdi.out <- sdi.mod.df %>%
mutate(preds = predict(sdi.mod, sdi.mod.df),
sdi.pub = ifelse(is.na(sdi.pub) | iso3 == "LBR", preds, sdi.pub)) %>%
select(location_name, iso3, year, sdi.pub, region) Forecasting the SDI covariate
The goal being to potentially forecast indicator values for the GPW13 and where missing, to impute them using the SDI as a covariate, an initial step is to complete the SDI time-series to cover the entire range relevant to the GPW13. This is accomplished by forecasting the SDI for the years applicable to the GPW13 (also back-casting where historical data are non-existent). In this case we forecast the SDI time-series using a random walk (RW2) model which has the density \[\pi(y) \propto exp\left(-\frac{1}{2}\sum_{i=3}^n (y_i - 2y_{i-1} + y_{i-2})^2\right)\] where the term \(y_{i+1} −2y_i +y_{i−1}\) can be interpreted as an estimate of the second order derivative of a continuous time function \(y(t)\) at \(t = i\) using values of \(y(t)\) at \(t = i − 1\), \(i\), and \(i+1\).
#################################################################################################################
# Function to forecast a series using time-series random walk
forecast.series <- function(dmod){
result <- inla(sdi.pub ~ f(x, model = "rw2"), data=dmod,
control.predictor= list(compute=TRUE))
tibble(pred = round(result$summary.fitted.values$mean, 3),
lwr = round(result$summary.fitted.values$"0.025quant", 3),
uppr = round(result$summary.fitted.values$"0.975quant", 3))
}
#################################################################################################################
# Loop performing forecast by country
isg <- unique(sdi.out$iso3)
out <- list(dim = length(isg))
dum <- tibble(x = 1:26, sdi.pub = NA)
for (ig in 1:length(isg)){
covar.db <- sdi.out %>% filter(iso3 == isg[ig]) %>% arrange(year)
cov.lab <- covar.db %>% select(location_name, iso3, region) %>% distinct()
dmod <- covar.db %>% mutate(x = year - 1999) %>% select(x, sdi.pub)
dmod <- dmod %>% rbind(dum %>% filter(x %!in% dmod$x)) %>% arrange(x)
preds <- forecast.series(dmod)
out[[ig]] <- data.table(cov.lab, year=2000:2025, sdi.pub=dmod$sdi.pub, preds)
}
forecasted.sdi <- rbindlist(out)In addition to SDI, we also take life-expectancy as a potential covariate. From the UNPOP division WPP2019, life-expectancy at birth has aleady been forecasted for 183 of the 194 member states. We read in the data:
#################################################################################################################
# life-expactancy indicator from WPP2019
e0.r <- fread("data/demogind.csv")
e0.in <- e0.r %>%
right_join(loc_label, by = "country_code") %>%
filter(!(Variant =="Medium variant" & year ==2020)) %>%
filter(year %in% 2000:2025) %>% distinct() %>%
select(iso3, year, e0) And also provide a list the countries missing \(e_0\):
## [1] "AND" "COK" "DMA" "KNA" "MCO" "MHL" "NIU" "NRU" "PLW" "SMR" "TUV"
We can impute life-expectancy using the SDI, again utilizing a basic linear model (also noting that we are effectively imposing a collinearity for the countries missing both SDI and \(e_0\)):
#################################################################################################################
# Predict e0 using the SDI and impute for the missing 11 locations
cov.in <- forecasted.sdi %>% left_join(e0.in, by=c("iso3","year")) %>% rename(sdi=pred)
m.e0 <- lm(e0 ~ sdi + region - 1, data = cov.in)
p.e0 <- predict(m.e0, cov.in)The model estimates are shown in the Table below:
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| sdi | 33 | 32, 33 | <0.001 |
| region | |||
| Caribbean | — | — | |
| Caribbean | 52 | 51, 52 | <0.001 |
| Central America | 55 | 55, 56 | <0.001 |
| Central Asia | 48 | 47, 49 | <0.001 |
| Eastern Africa | 50 | 49, 50 | <0.001 |
| Eastern Asia | 53 | 52, 54 | <0.001 |
| Eastern Europe | 49 | 48, 49 | <0.001 |
| Middle Africa | 45 | 45, 46 | <0.001 |
| Northern Africa | 53 | 52, 54 | <0.001 |
| Northern America | 52 | 50, 53 | <0.001 |
| Northern Europe | 52 | 51, 53 | <0.001 |
| Oceania | 52 | 51, 53 | <0.001 |
| South-Eastern Asia | 52 | 51, 53 | <0.001 |
| South America | 53 | 52, 54 | <0.001 |
| Southern Africa | 38 | 37, 39 | <0.001 |
| Southern Asia | 54 | 53, 55 | <0.001 |
| Southern Europe | 54 | 53, 55 | <0.001 |
| Western Africa | 49 | 49, 49 | <0.001 |
| Western Asia | 53 | 52, 54 | <0.001 |
| Western Europe | 53 | 52, 54 | <0.001 |
| R2 = 0.9975; R2 adjusted = 0.9975 | |||
|
1
CI = Confidence Interval
|
|||
Finally, we bring all the files together to create the covariate database complete for all countries and years 2000 to 2025:
#################################################################################################################
# Covariates data-frame
covariates <- cov.in %>%
mutate(p.e0s = p.e0, e0 = ifelse(is.na(e0), p.e0s, e0)) %>%
select(location_name, iso3, year, region, sdi.pub, sdi, lwr, uppr, e0) %>% distinct() %>%
filter(year %in% 2000:2025) %>% arrange(iso3, year) %>%
mutate(SDI.class = ifelse(iso3 %in% missing.sdi, "Imputed", "Public data"),
e0.class = ifelse(iso3 %in% missing.e0, "Imputed", "Public data"))
#################################################################################################################
# List to download
covariate.data <- list(sdi.gbd.r=sdi.gbd.r, sdi.gbd=sdi.gbd, sdi.mod.df=sdi.mod.df, sdi.out=sdi.out,
forecasted.sdi=forecasted.sdi,
e0.r=e0.r, e0.in = e0.in, cov.in = cov.in, covariates= covariates)All of the covariate raw and intermediate files and the final covariates dataframe can be downloaded as an R-database using the covariate.data.RDa link below:
The plot below illustrates the estimated and forecasted SDI by country. Observed values till the year 2017 are represented by the golden circles. The purple lines show the forecast described above with the uncertainty range depicted using the same colour. The plots also show the time-series of life-expectancy at birth \(e_0\) with a dashed series in dark green and the relevant axis to the right:
Maps for 2023
We can also inspect the geographic distribution of the values of these covariates that is forecasted for the 2023. Beginning with SDI:
And also showing life-expectancy at birth: